“Using dplyr to join data tables”
This document outlines some common data joining tasks with the dplyr package. If you’ve ever Googled ‘how to perform [x] in R’, you’ve probably discovered there are multiple ways to accomplish [x]. The methods I present here are not exhaustive, and I’ve tried to balance the trade-off between 1) solutions/functions that solve problems with the fewest number of keystrokes, and 2) code that’s readable, explicit, and easy to follow.
dplyr is part of the core tidyverse packages, so we install and load this meta-package below.
install.packages("tidyverse")
library(tidyverse)
library(fivethirtyeight)
library(inspectdf)
library(starwarsdb)
library(dm)
I’ll be using the namespace syntax (package::function()) just to make it explicitly clear which function belongs to which package, but this isn’t necessary if you’ve run the code block above.
Whenever I’m performing a series of joins, I like to use the inspectdf package. This package has a series of functions for viewing all the columns in a table by various types:
inspectdf::inspect_cat(): display categorical variables (missing values are displayed by color)
inspectdf::inspect_cor(): display the correlation between numerical variables
inspectdf::inspect_imb(): display imbalances between variables
inspectdf::inspect_mem(): display the memory usage by column
inspectdf::inspect_na(): display the rate of missingness by column
inspectdf::inspect_num(): display the distribution of the numeric columns.
inspectdf::inspect_types(): display the column types.
inspectdf::show_plot(): show the plot from the inspect_* function
text_labels = TRUE, col_palette = 1 (for the colorblind friendly palette).There are three principles for tidy data:
Joins give us the ability to combine multiple datasets on a common column. We’re going to be using the starwarsdb package, which dplyr::contains data from the Star Wars API.
The code below creates and displays the data model for the tables in this package
sw_data_model <- dm::dm_draw(dm = starwarsdb::starwars_dm(),
graph_name = "sw_data_model",
view_type = "all")
sw_data_model
As you can see from the graph above, there are nine tables in the starwarsdb. The tables connect to each other through a series of common columns (called keys), and these allow us to perform multiple kinds of joins.
Load the starwarsdb::films dataset, removing the opening_crawl column, and filter to only episodes 4-6. Store these data in sw_flms
sw_flms <- starwarsdb::films |>
dplyr::select(-c(opening_crawl, director, producer)) |>
dplyr::filter(episode_id >= 4 & episode_id <= 6)
sw_flms
# A tibble: 3 × 3
title episode_id release_date
<chr> <int> <date>
1 A New Hope 4 1977-05-25
2 The Empire Strikes Back 5 1980-05-17
3 Return of the Jedi 6 1983-05-25
left_joinA left_join keeps all of x, and joins it to all matching rows from dataset y
We want to left-join sw_flms to starwarsdb::films_vehicles:
starwarsdb::films_vehicles
# A tibble: 104 × 2
title vehicle
<chr> <chr>
1 A New Hope CR90 corvette
2 A New Hope Star Destroyer
3 A New Hope Sentinel-class landing craft
4 A New Hope Death Star
5 A New Hope Millennium Falcon
6 A New Hope Y-wing
7 A New Hope X-wing
8 A New Hope TIE Advanced x1
9 The Empire Strikes Back Star Destroyer
10 The Empire Strikes Back Millennium Falcon
# … with 94 more rows
We can do this on "title", then we can re-organize the columns using dplyr::select() to place the vehicle column before dplyr::everything() else.
dplyr::left_join(x = sw_flms,
y = starwarsdb::films_vehicles,
by = "title") |>
dplyr::select(vehicle,
dplyr::everything())
# A tibble: 47 × 4
vehicle title episode_id release_date
<chr> <chr> <int> <date>
1 CR90 corvette A New Hope 4 1977-05-25
2 Star Destroyer A New Hope 4 1977-05-25
3 Sentinel-class landing craft A New Hope 4 1977-05-25
4 Death Star A New Hope 4 1977-05-25
5 Millennium Falcon A New Hope 4 1977-05-25
6 Y-wing A New Hope 4 1977-05-25
7 X-wing A New Hope 4 1977-05-25
8 TIE Advanced x1 A New Hope 4 1977-05-25
9 Sand Crawler A New Hope 4 1977-05-25
10 T-16 skyhopper A New Hope 4 1977-05-25
# … with 37 more rows
Assign to sw_flms_veh
dplyr::left_join(x = sw_flms,
y = starwarsdb::films_vehicles,
by = "title") |>
dplyr::select(vehicle,
dplyr::everything()) -> sw_flms_veh
Let’s take a look at the sw_flms_veh table:
sw_flms_veh |> dplyr::glimpse()
Rows: 47
Columns: 4
$ vehicle <chr> "CR90 corvette", "Star Destroyer", "Sentinel-cl…
$ title <chr> "A New Hope", "A New Hope", "A New Hope", "A Ne…
$ episode_id <int> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5,…
$ release_date <date> 1977-05-25, 1977-05-25, 1977-05-25, 1977-05-25…
We can see this table has 4 columns and 47 rows. We’ll use the inspect_types() function below to view the breakdown of the columns by type:
sw_flms_veh |>
inspectdf::inspect_types() |>
inspectdf::show_plot(
text_labels = TRUE, col_palette = 1)
We can see most of these columns are character. Below we’ll display the categorical variables using inspectdf::inspect_cat():
sw_flms_veh |>
inspectdf::inspect_cat() |>
inspectdf::show_plot(
text_labels = TRUE, col_palette = 1)
sw_flms_veh |>
inspectdf::inspect_num() |>
inspectdf::show_plot(
text_labels = TRUE, col_palette = 1)
inner_joinThe inner_join() joins dataset x and y, and keeps only matching rows from both.
The code below uses an inner_join to join sw_films_veh to starwarsdb::pilots on "vehicle", and assigns it to sw_flms_veh_plt
dplyr::inner_join(x = sw_flms_veh,
y = starwarsdb::pilots,
by = "vehicle") -> sw_flms_veh_plt
sw_flms_veh_plt |> dplyr::glimpse()
Rows: 39
Columns: 5
$ vehicle <chr> "Millennium Falcon", "Millennium Falcon", "Mill…
$ title <chr> "A New Hope", "A New Hope", "A New Hope", "A Ne…
$ episode_id <int> 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5,…
$ release_date <date> 1977-05-25, 1977-05-25, 1977-05-25, 1977-05-25…
$ pilot <chr> "Chewbacca", "Han Solo", "Lando Calrissian", "N…
sw_flms_veh_plt |>
inspectdf::inspect_types() |>
inspectdf::show_plot(
text_labels = TRUE, col_palette = 1)
sw_flms_veh_plt |>
inspectdf::inspect_cat() |>
inspectdf::show_plot(
text_labels = TRUE, col_palette = 1)
right_joinThe right_join() includes all the rows in dataset y, and joins it the matching rows in dataset x
The code below uses a right_join() to join sw_flms_veh_plt to starwarsdb::films_planets on "title", keeping all the rows in starwarsdb::films_planets.
Then we use select() to rename planet to flm_planet and assign this to sw_flms_veh_plt_plnt
dplyr::right_join(
x = sw_flms_veh_plt,
y = starwarsdb::films_planets,
by = "title") %>%
dplyr::select(
vehicle:release_date,
film_planet = planet,
pilot) -> sw_flms_veh_plt_plnt
sw_flms_veh_plt_plnt |> dplyr::glimpse()
Rows: 183
Columns: 6
$ vehicle <chr> "Millennium Falcon", "Millennium Falcon", "Mill…
$ title <chr> "A New Hope", "A New Hope", "A New Hope", "A Ne…
$ episode_id <int> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,…
$ release_date <date> 1977-05-25, 1977-05-25, 1977-05-25, 1977-05-25…
$ film_planet <chr> "Tatooine", "Alderaan", "Yavin IV", "Tatooine",…
$ pilot <chr> "Chewbacca", "Chewbacca", "Chewbacca", "Han Sol…
sw_flms_veh_plt_plnt |>
inspectdf::inspect_types() |>
inspectdf::show_plot(
text_labels = TRUE, col_palette = 1)
sw_flms_veh_plt_plnt |>
inspectdf::inspect_cat() |>
inspectdf::show_plot(text_labels = TRUE,
col_palette = 1)
anti_joinThe anti_join() keeps all rows in dataset x that do not have a match in dataset y.
Note the last join created empty values for vehicle, episode_id, release_date, and pilot:
sw_flms_veh_plt_plnt |>
dplyr::filter(is.na(episode_id)) |>
inspectdf::inspect_na() |>
inspectdf::show_plot(text_labels = TRUE,
col_palette = 1) +
coord_flip()
The code below uses the anti_join() to join sw_flms_veh_plt_plnt to sw_films and create a dataset with only title and film_planet:
dplyr::anti_join(x = sw_flms_veh_plt_plnt,
y = sw_flms,
by = c("title", "episode_id", "release_date")) |>
dplyr::select(title, film_planet) |>
inspectdf::inspect_types() |>
inspectdf::show_plot(text_labels = TRUE, col_palette = 1)
dplyr::anti_join(x = sw_flms_veh_plt_plnt,
y = sw_flms,
by = c("title", "episode_id", "release_date")) |>
inspectdf::inspect_na() |>
inspectdf::show_plot(
text_labels = TRUE, col_palette = 1) +
ggplot2::coord_flip()
These should only contain the titles not in sw_films.